A bad 2-dimensional instance for k-means++
نویسندگان
چکیده
The k-means++ seeding algorithm is one of the most popular algorithms that is used for finding the initial k centers when using the k-means heuristic. The algorithm is a simple sampling procedure and can be described as follows: Pick the first center randomly from among the given points. For i > 1, pick a point to be the i center with probability proportional to the square of the Euclidean distance of this point to the previously (i− 1) chosen centers. The k-means++ seeding algorithm is not only simple and fast but gives an O(log k) approximation in expectation as shown by Arthur and Vassilvitskii [AV07]. There are datasets [AV07,ADK09] on which this seeding algorithm gives an approximation factor Ω(log k) in expectation. However, it is not clear from these results if the algorithm achieves good approximation factor with reasonably large probability (say 1/poly(k)). Brunsch and Röglin [BR11] gave a dataset where the k-means++ seeding algorithm achieves an approximation ratio of (2/3− ) · log k only with probability that is exponentially small in k. However, this and all other known lower-bound examples [AV07,ADK09] are high dimensional. So, an open problem is to understand the behavior of the algorithm on low dimensional datasets. In this work, we give a simple two dimensional dataset on which the seeding algorithm achieves an approximation ratio c (for some universal constant c) only with probability exponentially small in k. This is the first step towards solving open problems posed by Mahajan et al. [MNV12] and by Brunsch and Röglin [BR11].
منابع مشابه
A Bad Instance for k-Means++
k-means++ is a seeding technique for the k-means method with an expected approximation ratio of O(log k), where k denotes the number of clusters. Examples are known on which the expected approximation ratio of k-means++ is Ω(log k), showing that the upper bound is asymptotically tight. However, it remained open whether k-means++ yields an O(1)-approximation with probability 1/poly(k) or even wi...
متن کاملIdenti cation of Bad Signatures in BatchesJaros
The paper addresses the problem of bad signature identii-cation in batch veriication of digital signatures. The number of generic tests necessary to identify all bad signatures in a batch instance, is used to measure the eeciency of veriiers. The divide-and-conquer veri-er DCV(x; n) is deened. The veriier identiies all bad signatures in a batch instance x of the length n by repeatedly splitting...
متن کاملIdentification of Bad Signatures in Batches
The paper addresses the problem of bad signature identification in batch verification of digital signatures. The number of generic tests necessary to identify all bad signatures in a batch instance, is used to measure the efficiency of verifiers. The divide-and-conquer verifier DCVα(x,n) is defined. The verifier identifies all bad signatures in a batch instance x of the length n by repeatedly s...
متن کاملA sharp threshold for a random constraint satisfaction problem
We consider random instances I of a constraint satisfaction problem generalizing k-SAT: given n boolean variables, m ordered k-tuples of literals, and q “bad” clause assignments, find an assignment which does not set any of the k-tuples to a bad clause assignment. We consider the case where k = Ω(log n), and generate instance I by including every k-tuple of literals independently with probabili...
متن کاملClustering Analysis on E-commerce Transaction Based on K-means Clustering
Based on the density, increment and grid etc, shortcomings like the bad elasticity, weak handling ability of high-dimensional data, sensitive to time sequence of data, bad independence of parameters and weak handling ability of noise are usually existed in clustering algorithm when facing a large number of high-dimensional transaction data. Making experiments by sampling data samples of the 300...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1306.4207 شماره
صفحات -
تاریخ انتشار 2013